Skip to content

feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982

Open
iscekic wants to merge 75 commits into
mainfrom
feat/auto-routing-efficient-decision-engine
Open

feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982
iscekic wants to merge 75 commits into
mainfrom
feat/auto-routing-efficient-decision-engine

Conversation

@iscekic

@iscekic iscekic commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Benchmark-driven decision engine and kilo-auto/efficient

Summary

Adds a benchmark-driven model-routing pipeline behind a new hidden virtual model, kilo-auto/efficient: route each request to the cheapest model that is proven (by our own benchmarks) to be accurate enough for the request's difficulty.

Three moving parts:

  • services/auto-routing-benchmark (new Cloudflare Worker): runs two deterministic benchmarks — classifier prompt replay via OpenRouter, and decider golden tasks through the real kilo CLI in a Cloudflare Container — writes normalized results to D1, and publishes a routing table (per-difficulty-tier ranked candidates) plus a classifier winner.
  • services/auto-routing (existing worker): /decide classifies the request, derives a difficulty tier, and picks the cheapest above-threshold model from the routing table, with session-sticky decisions held in a Durable Object.
  • apps/web (gateway): exposes kilo-auto/efficient, blocks on /decide with a 2s timeout, falls back to the balanced Qwen default, bills the classifier LLM cost to the requesting user, and adds an admin panel for the whole pipeline.

Shared classifier code (prompt, parsing, fallback, taxonomy, tier derivation, routing-table schema) moves into the new packages/auto-routing-contracts package, so the benchmark replays exactly the code the production worker executes.

Architecture

client request (model = kilo-auto/efficient)
        │
        ▼
┌────────────────────────────┐  POST /decide (2s timeout;      ┌─────────────────────────────────┐
│ apps/web gateway           │  null/timeout → balanced Qwen)  │ services/auto-routing           │
│ · resolves kilo-auto/*     │ ───────────────────────────────▶│ · classify request (LLM)        │
│ · applies pinned           │                                 │ · derive difficulty tier        │
│   reasoningEffort          │                                 │ · cheapest above-threshold pick │
│ · bills classifier cost    │                                 │ · sticky decision per           │
│ · admin panel              │                                 │   conversation (DO)             │
└──────────┬─────────────────┘                                 └──────────────┬──────────────────┘
           │ admin proxy +                                     routing table & classifier winner:
           │ 6h token mint                                     isolate cache 60s → KV 1h →
           │ (internal secret)                                 service binding origin
           ▼                                                                  ▼
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ services/auto-routing-benchmark (new worker)                                                   │
│ · /admin/config /admin/runs /admin/routing-table /admin/classifier-winner /admin/debug-cli     │
│ · Queue (+DLQ) fans out per-model jobs                                                         │
│ · classifier bench: 72-case prompt replay via OpenRouter                                       │
│ · decider bench: 76 golden tasks via `kilo` CLI in a Cloudflare Container                      │
│ · D1 (drizzle, normalized): runs, case results, summaries, published routing tables            │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

Benchmark worker (services/auto-routing-benchmark)

Classifier benchmark

Replays 72 normalized classifier inputs through OpenRouter using the exact production classifier code (@kilocode/auto-routing-contracts/classifier). Each output is graded per-field against a hand-labeled expectation via CLASSIFIER_FIELD_WEIGHTS (src/grading.ts): taskType 0.25, reasoningComplexity 0.20, contextComplexity 0.15, executionMode 0.15, subtaskType 0.10, requiresTools 0.10, riskLevel 0.05. Heuristic-fallback outputs score 0.

The winner (src/winner.ts) is the cheapest model meeting the run's accuracy threshold (most accurate one if none do). It feeds the worker's classifier-model resolution chain (below).

Decider benchmark

Runs 76 golden tasks per candidate model through the real kilo CLI (@kilocode/cli) inside a Cloudflare Container (container/Dockerfile + container/server.mjs, node:22-slim, standard-2). Grading is purely mechanical — exact / contains_all / regex / json_equal checks, no LLM judges; golden answers were hand-derived and mechanically re-verified where executable. Cases include genuinely agentic tasks performed with file/terminal tools inside the container (deterministic: no repo, no network).

Execution details:

  • One queue message per (model, 10-case chunk); each chunk gets its own container instance (runId:model:chunk) so models/chunks never share state. CLI runs are serialized per instance (the CLI's sqlite state is not safe under concurrent first runs); a /warmup endpoint absorbs the one-time sqlite migration before the case loop.
  • Each candidate's pinned reasoningEffort is forwarded as the CLI's --variant, so the benchmark measures the model exactly as it will be served.
  • The CLI authenticates as a real Kilo user: the worker mints a short-lived token once per queue message via apps/web's internal endpoint (token only ever lives in a child-process env var, never logged or written to disk).
  • Empty-output sessions (exit 0, no assistant text) are retried once, mirroring the production classifier's retry policy; costs of both attempts are summed.

Datasets

Both datasets cover all 18 (taskType, subtaskType) taxonomy pairs with at least 4 cases per pair — enforced by tests (src/datasets/*.test.ts). Decider cases each carry exactly one difficulty tier with at least 4 distinct task types per tier.

D1 schema (src/db-schema.ts)

Fully normalized, zero JSON blob columns, composite-PK-only access:

  • benchmark_config + config_classifier_models + config_decider_models — admin config (incl. per-decider-model reasoning_effort).
  • benchmark_runs — carries a config snapshot (min_accuracy, switch_cost_factor, max_concurrency, benchmark_user_id) taken at startRun time, so mid-run admin edits can't skew results. All job processing and publishing reads the snapshot, never live config.
  • run_models — which models were enqueued vs. skipped, with the pinned reasoning_effort snapshot.
  • case_results — per (run, model, case) score/latency/cost plus diagnostics (classifier fallback reason, CLI exit code/output prefix/event tail).
  • model_summaries — per (run, model, tier) aggregates. Carried summaries: models with prior results are skipped on new runs (their latest summaries are copied in with carried=true), so re-runs only spend on new candidates; the admin can force a full re-run.
  • routing_tables + routing_table_candidates — published tables, queryable history.

Single squashed baseline migration (migrations/0000_amused_shard.sql), applied by a predeploy script (wrangler d1 migrations apply --remote) which the CI deploy workflow now runs for any worker that defines one (.github/workflows/deploy-workers.yml).

Publishing

On run completion the worker builds the routing table from the run's own snapshot (src/routing-table-builder.ts): per tier, candidates are ranked best-bang-for-buck (above-threshold cheapest-first, below-threshold by accuracy). Models with zero graded cases or no cost signal in a tier are excluded; if any tier ends up empty the publish is skipped and the previous table stays live (schema enforces .min(1) per tier). Publishing only deletes the KV cache keys so the auto-routing worker repopulates from D1 on the next read.

Decision engine (services/auto-routing)

/decide (existing endpoint, now decision-capable):

  1. Classifies the request (per-conversation classification cache in a Durable Object, keyed by classifier model + content hash).
  2. Derives a difficulty tier (deriveDifficultyTier in contracts: reasoning complexity dominates at 2x weight; context, execution mode, and risk nudge borderline cases).
  3. Picks from the routing table (src/decision-engine.ts): cheapest above-threshold candidate for the tier — unless the session has an incumbent.

Session stickiness: the conversation's Durable Object remembers the last served model. The incumbent is kept while it still meets the tier's accuracy threshold, unless the fresh pick is cheaper by more than the table's switchCostFactor. Rationale (commented in code): a model switch discards the provider's prompt cache, and rebuilding it costs full-price input tokens (4–10x cache-read rates) on a context that dominates agent-session spend — switching only pays off when recurring per-turn savings clearly exceed that one-time penalty. Sticky state trusts only real classifier output: heuristic fallbacks never re-anchor the session's model.

Routing table access: read-through chain — isolate-local 60s TTL cache → KV (1h TTL, shared AUTO_ROUTING_CONFIG namespace) → service binding to the benchmark worker's D1-backed /admin/routing-table. Corrupt KV values are treated as misses; origin failures degrade to null (no decision) rather than erroring the request.

Classifier model resolution (src/classifier-config.ts): admin KV override → benchmark winner (same KV read-through, derived on read) → built-in default google/gemini-2.5-flash-lite. A benchmark-origin failure never discards a healthy override.

Gateway (apps/web)

  • kilo-auto/efficient (src/lib/ai-gateway/auto-model/index.ts): hidden virtual model (excluded from /models, usable by id) with the same catalog properties as balanced — intended to eventually replace it, hidden while validated on Kilo team traffic.
  • Resolution (auto-model/resolution.ts + auto-routing-decision.ts): blocks on /decide with a 2s timeout; on a decision, serves the decided model and applies its pinned reasoningEffort so it runs under the same conditions the benchmark measured. On null/timeout/error, serves BALANCED_QWEN_MODEL — an efficient request never degrades below balanced.
  • Billing: the classifier LLM cost returned by /decide is billed to the requesting user as a separate microdollar usage row (requested_model: kilo-auto/efficient), so routing overhead is visible and attributed rather than absorbed.
  • Admin panel (admin/auto-routing/BenchmarksSection.tsx, proxied through admin API routes with the internal secret): config editor (classifier/decider model lists, per-decider reasoningEffort, minAccuracy, switchCostFactor, maxConcurrency, benchmarkUserId), run triggers with a force-rerun toggle, run history, and the live published routing table.
  • Config-save validation (admin/api/auto-routing/benchmark-config/route.ts): every decider model must be servable on all gateway chat API kinds (chat_completions, responses, messages) by the provider the gateway would route it to — the routing table deliberately carries no per-protocol metadata, so this invariant is enforced at write time.
  • Token mint (api/internal/auto-routing-benchmark/token/route.ts): POST gated by INTERNAL_API_SECRET; mints a 6h full user API token (tokenSource: auto-routing-benchmark) for the decider CLI's identity/billing.

Design properties

  • No fabricated data anywhere. There is no default routing table: /decide returns null decisions until a benchmark publishes one, and the gateway serves balanced fallbacks. There is no default benchmark config: runs refuse to start until an admin saves one (and decider runs additionally fail fast without a benchmarkUserId).
  • Deterministic, reproducible grading. Mechanical checks only; run-level config snapshots; routing tables built from the run's snapshot, not live config.
  • Cheap iteration. Carried summaries mean adding one candidate model re-benchmarks only that model; config-only changes (model removed, threshold tweaked) republish instantly with zero spend.
  • Graceful degradation at every layer. Corrupt KV → miss; origin failure → previous behavior; classifier failure → no decision → balanced fallback; publish failure → previous table stays live.

Infrastructure

  • D1 auto-routing-benchmark in region EEUR, primary in Frankfurt (colo FRA — next to the backend; verified via wrangler d1 info).
  • Queue auto-routing-benchmark-jobs (max_concurrency 4, max_retries 2) + DLQ auto-routing-benchmark-dlq.
  • Container app auto-routing-benchmark-runner (standard-2, max 40 instances), image built and pushed by wrangler deploy.
  • Service binding auto-routingauto-routing-benchmark; shared KV namespace AUTO_ROUTING_CONFIG.
  • Both workers already run this branch's code; the D1 database is empty pending admin setup.

Post-merge deploy / cutover checklist

  1. Merge → Vercel ships the gateway side (kilo-auto/efficient, admin panel, token mint).
  2. First post-merge worker deploy runs the D1 migration via the new CI predeploy hook — CI's CLOUDFLARE_API_TOKEN needs D1 edit permission (the deploy will surface it if missing).
  3. Admin saves a benchmark config: benchmarkUserId is required for decider runs (consider a dedicated service account — its account is billed for CLI usage); suggested switchCostFactor starting value: 3.
  4. Trigger a classifier run and a decider run from the admin panel.
  5. Clear the leftover classifier_model KV override (currently set to flash-lite) if the benchmark winner should drive classifier selection.

Reviewer notes

  • The exact decider check also accepts the last non-empty output line (src/grading.ts): agent harnesses sometimes prepend preamble despite instructions; wrong answers fail either way.
  • @kilocode/cli@latest is resolved at image build time, i.e. each deploy pins whatever was latest then; re-deploy to pick up a newer CLI.
  • The token-mint endpoint is gated by INTERNAL_API_SECRET and can mint for any user id; scoping it to the configured benchmark user is a reasonable follow-up.
  • The decider benchmark exercises models through chat_completions only (the CLI's path). Config-save validation guarantees candidates are servable on all three chat API kinds, but accuracy is only measured on one.

iscekic added 23 commits June 11, 2026 21:58
Mints a short-lived (6h) user API token for a given userId, guarded by the
shared internal secret over Authorization: Bearer. The decider benchmark uses
this to authenticate the kilo CLI against the gateway under a real user's
identity.
… container

The decider benchmark now executes each case through the stable kilo CLI
(@kilocode/cli) running in a Cloudflare Container, instead of bare OpenRouter
chat completions, so it measures the real agent harness.

- Container (Dockerfile + dependency-free server.mjs) spawns `kilo run
  --format json --auto` per case; the kilo user token is injected only as a
  child-process env var, never logged or written to disk.
- BenchRunnerContainer DO + wrangler containers/durable_objects/migrations.
- kilo-events.ts: pure parser for the CLI JSON event stream (text + cost),
  tolerant of both part.* and flattened event shapes.
- cli-runner.ts: proxies a case to the container and parses the result.
- run.ts: chunks decider cases (10/chunk) into per-(model,chunk) queue
  messages; fetches a short-lived user token once per message; fails fast when
  benchmarkUserId is unset (plus a defensive per-case guard). Classifier path
  unchanged.
- New benchmarkUserId config field (nullable) on BenchmarkConfig.
- vitest aliases @cloudflare/containers to a node-safe stub so unit tests can
  import the worker entry without the cloudflare:workers chain.
Adds a Benchmark user id input to the benchmark config editor (empty -> null),
with help text noting decider runs fail until it is set. Round-trips through
configToFormState/formStateToConfig.
…retries

- accept step_finish (underscore) events so per-case cost is summed
- retry once when a CLI session ends with no assistant text
- exact checks also accept the last non-empty output line
- uniform final-answer suffix on decider prompts
- /admin/debug-cli endpoint returning raw CLI events for diagnosis
@iscekic iscekic self-assigned this Jun 11, 2026
- serialize CLI runs per container and run decider cases sequentially
  (the CLI sqlite migration is unsafe under concurrent sessions)
- add dead-letter queue and raise container instance ceiling
- redact the kilo token from captured stderr before it leaves the container
- timing-safe secret comparison and tokenSource audit field on minted tokens
- validate persisted routing tables before serving them from the admin API
- regenerate worker types with the production web base URL
- dedupe the routing-table response schema; tier boundary tests
@iscekic iscekic marked this pull request as ready for review June 11, 2026 23:12
@iscekic iscekic force-pushed the feat/auto-routing-efficient-decision-engine branch from ca99949 to cac57b7 Compare June 11, 2026 23:12
Comment thread services/auto-routing-benchmark/container/server.mjs
Comment thread services/auto-routing-benchmark/src/index.ts
@kilo-code-bot

kilo-code-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

All previously flagged issues are now resolved; the incremental commit correctly fixes the HTTP 400 precondition response for POST /admin/runs with no saved config.

Resolved Issues
File Issue Status
services/auto-routing-benchmark/src/admin.ts startRun null-config throw returned HTTP 500 instead of 400 — user-facing precondition error misclassified as server fault ✅ Fixed (165240b)
apps/web/src/lib/ai-gateway/auto-routing-mirror.ts Stale comment referencing EfficientDecisionParams ✅ Fixed (a449c26)
services/auto-routing-benchmark/src/run.ts Redundant getRunState D1 round-trip ✅ Fixed (82aef0b)
apps/web/src/app/api/internal/auto-routing-benchmark/token/route.ts Local timingSafeStringEqual — now consolidated into @kilocode/encryption ✅ Fixed
services/auto-routing-benchmark/src/ttl-cache.ts Duplicated ttlCached utility — now promoted to @kilocode/worker-utils ✅ Fixed
packages/auto-routing-contracts/src/benchmark.ts ReasoningEffortSchema duplicated — now canonical in tiers.ts ✅ Fixed
services/auto-routing-benchmark/container/server.mjs:109 child.on('error') and child.on('close') both calling finish() without a guard ✅ Fixed (ba3b3be)
services/auto-routing-benchmark/src/index.ts:29 processJob unhandled throw crashing the queue handler ✅ Fixed (ba3b3be)
services/auto-routing/src/classifier-config.ts Missing .catch() guard on kvReadThrough ✅ Fixed (01e4bd9)
services/auto-routing-benchmark/src/routing-table-builder.ts Null-cost summaries excluded from tier ranking ✅ Fixed (71222ca)
services/auto-routing-benchmark/src/db-schema.ts Redundant idx_case_results_run index ✅ Fixed (354054d)
packages/auto-routing-contracts/src/routing-table.ts ClassifierApiKindSchema validation ✅ Fixed
services/auto-routing-benchmark/container/server.mjs Process-group kill on decider case timeout ✅ Fixed (ae0cec5)
Files Reviewed (incremental — 2 files)
  • services/auto-routing-benchmark/src/admin.ts — 0 issues (WARNING resolved)
  • services/auto-routing-benchmark/src/admin.test.ts — 0 issues
Previous Review Summary (commit 9eaae60)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit 9eaae60)

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

POST /admin/runs still surfaces a user-facing precondition error (no config saved) as HTTP 500 instead of 400; all incremental commits — Morph provider removal, GLM oneOf schema-logging removal, kimi-k2.7-code thinking-only variants, and test stub hardening — are clean.

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/auto-routing-benchmark/src/admin.ts 44 startRun null-config throw returns HTTP 500 instead of 400 — user-facing precondition error misclassified as server fault
Resolved Issues (all fixed in prior commits)
File Issue Status
apps/web/src/lib/ai-gateway/auto-routing-mirror.ts Stale comment referencing EfficientDecisionParams ✅ Fixed (a449c26)
services/auto-routing-benchmark/src/run.ts Redundant getRunState D1 round-trip ✅ Fixed (82aef0b)
apps/web/src/app/api/internal/auto-routing-benchmark/token/route.ts Local timingSafeStringEqual — now consolidated into @kilocode/encryption ✅ Fixed
services/auto-routing-benchmark/src/ttl-cache.ts (×2 copies) Duplicated ttlCached utility — now promoted to @kilocode/worker-utils ✅ Fixed
packages/auto-routing-contracts/src/benchmark.ts ReasoningEffortSchema duplicated in benchmark and index — now canonical in tiers.ts ✅ Fixed
services/auto-routing-benchmark/container/server.mjs:109 child.on('error') and child.on('close') both calling finish() without a guard ✅ Fixed (ba3b3be)
services/auto-routing-benchmark/src/index.ts:29 processJob unhandled throw crashing the queue handler ✅ Fixed (ba3b3be)
services/auto-routing/src/classifier-config.ts Missing .catch() guard on kvReadThrough — benchmark failure could discard healthy admin override ✅ Fixed (01e4bd9)
services/auto-routing-benchmark/src/routing-table-builder.ts Null-cost summaries excluded from tier ranking ✅ Fixed (71222ca)
services/auto-routing-benchmark/src/db-schema.ts Redundant idx_case_results_run index — composite PK leftmost column already covers run_id prefix scans ✅ Fixed (354054d)
packages/auto-routing-contracts/src/routing-table.ts ClassifierApiKindSchema and per-candidate supportedApiKinds ✅ Fixed — all-API-kinds validation now enforced at config save in benchmark-config/route.ts
services/auto-routing-benchmark/container/server.mjs Process-group kill (detached + killProcessTree + child.on('exit') backstop) ✅ Fixed (ae0cec5)
Incremental Changes Reviewed (commits 1a5d8589eaae60)
  • apps/web/src/lib/ai-gateway/providers/morph.ts — deleted. morph_warp_grep_free_model removed from kiloExclusiveModels in models.ts. 'morph' provider entry removed from provider-definitions.ts and types.ts. forbidden-free-models.ts correctly adds 'morph-warp-grep-v2' per the AGENTS.md rule for removed free models. The remaining 'morph' entries in inference-provider-id.ts are OpenRouter's third-party inference network identifiers — correct to leave. Clean.
  • apps/web/src/lib/ai-gateway/schema-logging.ts — deleted. GLM oneOf schema diagnostic logger removed (reverted from glm-logging feature). applyProviderSpecificLogic signature drops organizationId parameter, call site in route.ts updated accordingly. Clean.
  • apps/web/src/lib/ai-gateway/providers/model-settings.ts — extracts REASONING_VARIANTS_THINKING_ONLY (thinking-only, no instant variant) and adds a specific check for kimi-k2.7-code before the isKimiModel catch-all, correctly giving this model thinking-only variants. Ordering is correct. Clean.
  • apps/web/src/app/admin/api/auto-routing/benchmark-config/route.test.ts / model-api-kinds.test.ts / openrouter/index.test.ts — tests updated to replace morph_warp_grep_free_model with stable stub models, eliminating dependency on removed provider file. Clean.
  • apps/web/src/components/shared/ModelCombobox.tsx — visual tweak to free-model badge (ghost style with ring). UI-only. Clean.
  • apps/web/src/lib/ai-gateway/forbidden-free-models.ts — adds 'morph-warp-grep-v2' to the forbidden set. Clean.
Files Reviewed
  • services/auto-routing-benchmark/src/admin.ts — 1 issue (carried forward)
  • apps/web/src/lib/ai-gateway/providers/morph.ts (deleted) — 0 issues
  • apps/web/src/lib/ai-gateway/schema-logging.ts (deleted) — 0 issues
  • apps/web/src/lib/ai-gateway/providers/apply-provider-specific-logic.ts — 0 issues
  • apps/web/src/lib/ai-gateway/providers/model-settings.ts — 0 issues
  • apps/web/src/lib/ai-gateway/providers/provider-definitions.ts — 0 issues
  • apps/web/src/lib/ai-gateway/providers/types.ts — 0 issues
  • apps/web/src/lib/ai-gateway/models.ts — 0 issues
  • apps/web/src/lib/ai-gateway/forbidden-free-models.ts — 0 issues
  • apps/web/src/app/api/openrouter/[...path]/route.ts — 0 issues
  • apps/web/src/components/shared/ModelCombobox.tsx — 0 issues
  • apps/web/src/app/admin/api/auto-routing/benchmark-config/route.test.ts — 0 issues
  • apps/web/src/lib/ai-gateway/model-api-kinds.test.ts — 0 issues
  • apps/web/src/lib/ai-gateway/providers/openrouter/index.test.ts — 0 issues

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 415,474 tokens

Review guidance: REVIEW.md from base branch main

Comment thread services/auto-routing-benchmark/src/admin.ts
iscekic added 11 commits June 12, 2026 17:01
…nomy coverage

Grow the decider benchmark from 30 to 76 cases so every
(taskType, subtaskType) pair in the classifier taxonomy has at least
4 mechanically-checkable cases, with at least 20 cases per difficulty
tier (23 low / 31 medium / 22 high).

- DeciderCase gains subtaskType; ids follow the
  <taskType>-<subtype>-<topic> scheme used by the classifier dataset
- Existing cases retagged with subtypes where they genuinely fit
  (three system-behavior investigation cases moved to
  planning_design/system_design, the HTTP 201 lookup to
  investigation/external_research, and the let-closure case reframed
  as refactoring/migration)
- New agentic_execution cases are self-contained file/terminal tasks
  deterministic in the node:22-slim container
- Tests now enforce per-pair and per-tier quotas from the
  classifierTaxonomy export, subtype/taskType consistency, regex
  compilability, and json_equal round-tripping
Remember the last served model per conversation in the decision-cache DO
and keep it while it meets the current tier's accuracy threshold, unless
the fresh pick is cheaper by more than the routing table's new
switchCostFactor. Switching models discards provider prompt caches, so a
session whose difficulty tier oscillates no longer ping-pongs between
models. Decisions report a sticky flag in the response and the
auto_routing_decision log line.
…runs, and routing table

Store the new BenchmarkConfig.switchCostFactor in the benchmark_config
singleton, snapshot it into benchmark_runs at startRun, and carry the
run's snapshotted value into published routing tables so the schema's
required RoutingTableSchema.switchCostFactor parses on read. Regenerate
the squashed D1 baseline migration, add a Switch cost factor field to
the admin config form, and update test fixtures (including the apps/web
decision fixtures missing the new required sticky flag).
…e at config save

All decider candidates are served via providers that speak every gateway
chat API (in practice OpenRouter), so per-candidate supportedApiKinds was
dead weight in the contracts, decision engine, D1 schema, and routing
table. The one real failure mode - an admin configuring a model whose
serving provider is chat-completions-only - is now rejected at config
save time instead.
- never let a heuristic fallback classification re-anchor the session's
  sticky model (same trust rule as the classification cache)
- drop the dead ClassifierApiKindSchema export
- rename the decider pages-helper case so its id no longer collides with
  the classifier dataset's debug-fix-pagination-slice in shared telemetry
- trim a stale JSDoc in model-api-kinds.ts
iscekic added 11 commits June 12, 2026 21:06
- Inject KILO_API_URL into the benchmark container via a new
  KILO_CLI_API_URL worker var so the kilo CLI targets the same gateway
  the worker mints tokens against (prod default: api.kilo.ai).
- Add .dev.vars.example mapping both URLs to the local apps/web dev
  server (worker-side localhost, container-side host.docker.internal).
- Add AUTO_ROUTING_BENCHMARK_WORKER_URL to the apps/web env example so
  the admin panel proxies to the local benchmark worker instead of prod.
- Work around wrangler force-pulling the amd64 container egress proxy
  on Apple Silicon (its transparent-proxy setsockopt crashes under
  emulation, failing every local container start) by pinning the arm64
  manifest digest via MINIFLARE_CONTAINER_EGRESS_IMAGE in the dev
  runner.
…meout

The kilo bin is a Node wrapper that spawns the real CLI binary as a
grandchild. SIGKILLing only the wrapper orphaned the grandchild on
timeout: it kept running (and spending) and held the stdout/stderr
pipes open, so 'close' never fired, the case promise never resolved,
and the chunk's queue message hung until the runtime cut it — then
retried from case 0 and eventually dead-lettered. Observed live: a
runaway agentic case ran 20+ minutes past the 180s cap and wedged the
whole run.

Spawn the CLI detached so it leads its own process group, kill the
group on timeout, and add an after-exit grace backstop so a stray
pipe-holder can never hang a case again.
…r latency gate

- Config gains classifierRepetitions, deciderRepetitions (1-5), and
  classifierMaxP95LatencyMs (null = no constraint); run rows snapshot the
  active repetition count and latency budget at start time.
- case_results PK extended with rep column; timed_out column added.
- model_summaries gains p95_latency_ms (nearest-rank p95 over all rows)
  and timeouts count.
- pickClassifierWinner enforces an optional p95 latency budget: candidates
  meeting both accuracy and latency are ranked by cost; when none meet the
  budget, falls back to lowest-p95 among accuracy-meeting models.
- classifier_winner contract surfaces the winner's p95LatencyMs.
- DECIDER_CHUNK_SIZE reduced from 10 to 5 to stay well within queue
  consumer wall-clock limits.
- Container server propagates timedOut flag through ContainerRunResponse
  and CliRunResult so timed-out cases are recorded in D1.
…test gaps

- Migration 0001: replace "rep"/"timed_out" column refs in INSERT...SELECT
  with literal 0,0 — old table lacks those columns; D1 silently degrades
  double-quoted unknowns to string literals, corrupting NOT NULL integer rows.
- Contracts: add BenchmarkConfigSchema defaults test (classifierRepetitions=1,
  deciderRepetitions=1, classifierMaxP95LatencyMs=1000 when omitted).
- Benchmark: extract buildDeciderMessages() pure function; add fan-out test
  asserting models × reps × ceil(76/5) messages each carrying the correct rep.
…olumns

Add classifier/decider repetitions (1–5) and classifierMaxP95LatencyMs
inputs to the Benchmark Config card; add p95 latency and Timeouts
columns to the run summaries table; update test fixtures with new fields.
Set both RunSummariesTable colSpan values back to 6 to match the outer
BenchmarkRunsTable's 6-column header (chevron, Kind, Status, Started,
Completed, Error). Export configToFormState and formStateToConfig for
unit testing and add focused tests covering null-config defaults,
round-trip preservation of repetitions/latency fields, and empty-string
classifierMaxP95LatencyMs coercing to null.
…ests

Main merged PR #4004 which deleted the morph provider. The two test files
that exercised the rejection branch of modelServesAllGatewayChatApis used
morph as the only available Kilo-exclusive model on a chat_completions-only
gateway. With morph gone, no real catalog entry satisfies that condition.

Both test files now stub findKiloExclusiveModel via jest.mock/requireActual
so that the marker id 'test-exclusive/alibaba-only' returns a KiloExclusiveModel
with gateway: 'alibaba'. The real PROVIDERS.ALIBABA definition supports only
chat_completions, so the rejection path is exercised without relying on any
specific provider file being present in the catalog.
…onfig

The POST /admin/runs handler let startRun's "config not set" precondition
error propagate to the global error handler, surfacing a client-side
precondition as HTTP 500. Guard the null config in the route handler,
mirroring the /admin/debug-cli pattern, and return 400 instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant